Craftcans.com provides a database of 2692 crafted canned beers. The data on beers includes the following variables:
However, some of the variables include both number and text values (e.g. Size), while others include missing values (e.g. IBUs). In order to make the dataset ready for analysis, one needs to clean it first. We will do that using pandas and regular expressions.
In [2]:
import pandas, re
In [3]:
data = pandas.read_excel("craftcans.xlsx")
In [4]:
data.head()
Out[4]:
As it can be seen above, the header row is saved as a simple observation unit. Let's rename the columns with the real headers.
In [5]:
data.columns = data.iloc[0]
In [6]:
data.head()
Out[6]:
Fine, but the very first row still remains. We have to drop it, and for that we will use the drop() function from the pandas library, which takes 2 arguments: the dropable row/column name and the axis (0 for rows and 1 for columns).
In [7]:
data = data.drop(0,axis=0)
In [8]:
data.head()
Out[8]:
Let's do the same for row names. Rows are called indecies in Pandas. Thus, let's take the values from the "ENTRY" column and use them to rename rows. Then, of course, we shoudl drop the additional column too.
In [9]:
data.index = data["ENTRY"]
In [10]:
data.head()
Out[10]:
In [11]:
data = data.drop("ENTRY",axis=1)
In [12]:
data.head()
Out[12]:
Nice, now let's clean some variables. Let's start from the SIZE. It includes information on size which is presented in oz or ounces or differently. We need to have numbers only. Let's first see what are the available options. FOr that purpose, we can convert that columns to a list and then use the set() function to get the unique values from the list.
In [13]:
data_list = data["SIZE"].tolist()
unique_values = set(data_list)
print(unique_values)
Excellent. This means we can write a regular expression that will find all the digits (including those that have a dot inside) and subsitute whatever comes afterwards with an empty string.
In [14]:
for i in range(0,len(data['SIZE'])):
data['SIZE'][i] = re.sub('(^.*\d)(\s*.*$)',r'\1',data['SIZE'][i])
In [15]:
data.head()
Out[15]:
Done! Let's now go for the ABV variable. It is given in %-s, so we can keep the number only, and divide it by 100 to get the float value. But there may be some wrongly inputed values in the columns also. So let's divide only those that have correct values and assign a "missing value" value to others.
In [17]:
# for all values in that columns
for i in range(0,len(data['ABV'])):
# if match exists, which means it is a correct value
if re.match('(^.*\d)(%)',data['ABV'][i]) is not None:
# substitute the % sign with nothing, convert result to float and divide by 100
data['ABV'][i] = float(re.sub('(^.*\d)(%)',r'\1',data['ABV'][i]))/100
else: # which is when the value is incorrect
# give it the value of "nan" which stands for missing values
data['ABV'][i] = float("nan")
In [18]:
data['ABV'].head(100)
Out[18]:
Great! Let's now get some info on our dataframe
In [19]:
data.info()
As you can see the ABV guy is left with only 2348 values out of 2410, as we assigned "nan" to incorrect values. Let's impute those missing values. As it is a variable with integer values,we can impute with mean using the fillna() function from pandas.'
In [20]:
data['ABV'] = data['ABV'].fillna(data['ABV'].mean())
In [21]:
data.info()
Done! But there is another variable with missing values: IBUs. Let's make an imputation for that one also, but this time instead of mean let's use the backward/forward filling method.
In [22]:
data['IBUs'] = data['IBUs'].fillna(method = "bfill")
In [23]:
data.info()